Analysis


Songs on TikTok on both the Billboard and Spotify Charts

What songs were popoular on TikTok that also appeared on both the Spotify and Billboard charts?

By excluding the NA values, this gives a dataframe of eleven songs that appear on all three charts.

# make a new dataframe with popular songs on TikTok that also appear on other charts
songs_on_all_charts <- combined_cleaned[complete.cases(combined_cleaned[ , c('rank_s', 'rank_t', 'rank_b')]), ]
# make the table look nice
gt(songs_on_all_charts) |>
  tab_header(
    title="Popular Songs on TikTok appearing on Other Charts"
  )
Popular Songs on TikTok appearing on Other Charts
song_title artist rank_s rank_t rank_b tempo
Heat Waves Glass Animals 1 89 1 80.870
As It Was Harry Styles 1 96 2 173.930
Glimpse of Us Joji 1 97 52 169.914
Late Night Talking Harry Styles 2 93 25 114.996
First Class Jack Harlow 2 89 6 107.005
Bad Habits Ed Sheeran 3 22 13 126.011
Woman Doja Cat 6 89 21 107.998
Need to Know Doja Cat 6 85 18 130.041
About Damn Time Lizzo 9 92 12 108.965
Bad Habit Steve Lacy 10 86 28 168.946
I Hate U SZA 19 79 50 106.703

This is essentially the same as the R version – excluding null values results in the songs that appeared on all charts.

SELECT *,
FROM combined_cleaned
WHERE rank_s IS NOT NULL 
  AND rank_t IS NOT NULL 
  AND rank_b IS NOT NULL
Displaying records 1 - 10
song_title artist rank_s rank_t rank_b tempo
Heat Waves Glass Animals 1 89 1 80.870
Bad Habits Ed Sheeran 3 22 13 126.026
Need to Know Doja Cat 6 85 18 130.041
Glimpse of Us Joji 1 97 52 169.914
About Damn Time Lizzo 9 92 12 108.966
Late Night Talking Harry Styles 2 93 25 114.996
First Class Jack Harlow 2 89 6 106.998
I Hate U SZA 19 79 50 106.703
As It Was Harry Styles 1 96 2 173.930
Woman Doja Cat 6 89 21 107.998

Again, this excludes the null values.

all_charts = combined[combined[['rank_s', 'rank_t', 'rank_b']].notnull().all(1)]

GT(all_charts) # makes the table look nice
song_title artist rank_s rank_t rank_b tempo
About Damn Time Lizzo 9.0 92.0 12.0 108.965
As It Was Harry Styles 1.0 96.0 2.0 173.93
Bad Habit Steve Lacy 10.0 86.0 28.0 168.946
Bad Habits Ed Sheeran 3.0 22.0 13.0 126.011
First Class Jack Harlow 2.0 89.0 6.0 107.005
Glimpse of Us Joji 1.0 97.0 52.0 169.914
Heat Waves Glass Animals 1.0 89.0 1.0 80.87
I Hate U SZA 19.0 79.0 50.0 106.703
Late Night Talking Harry Styles 2.0 93.0 25.0 114.996
Need to Know Doja Cat 6.0 85.0 18.0 130.041
Woman Doja Cat 6.0 89.0 21.0 107.998

Songs Appearing on Charts

How many songs appeared on all three charts? How many on just TikTok and Spotify? How many on just Billboard and Spotify? etc.

To calculate these values, we first created new columns (one for each chart) that held either TRUE or FALSE depending on whether the song appeared on the corresponding chart. These new columns allowed us to use the summarize function to calculate how many true values appeared in each combination of columns. The results count how many songs appear on the various combinations of charts.

combined_cleaned |>
  mutate( #create new columns that identify whether a song appears on the chart using TRUE/FALSE
    on_tiktok = !is.na(rank_t), #identifies the non-NA songs appearing on the tiktok chart
    on_spotify = !is.na(rank_s), #identifies the non-NA songs appearing on the spotify
    on_billboard = !is.na(rank_b) #identifies the non-NA songs appearing on the billboard chart
  ) |>
  summarize( #calculates the total number of TRUE values for each combination of charts
    only_tiktok = sum(on_tiktok & !on_spotify & !on_billboard),
    only_spotify = sum(!on_tiktok & on_spotify & !on_billboard),
    only_billboard = sum(!on_tiktok & !on_spotify & on_billboard),
    only_tiktok_and_spotify = sum(on_tiktok & on_spotify & !on_billboard),
    only_tiktok_and_billboard = sum(on_tiktok & !on_spotify & on_billboard),
    only_spotify_and_billboard = sum(!on_tiktok & on_spotify & on_billboard),
    all_three = sum(on_tiktok & on_spotify & on_billboard)
  )

We came up with two different methods to find the answer. The first uses CTEs and pivots to calculate the counts.

CREATE OR REPLACE TABLE total_counts_with AS
WITH cte_1 AS(
-- unpivot the table to make it longer
  UNPIVOT combined_cleaned
    ON rank_s, rank_t, rank_b
    INTO
      NAME charts
      VALUE rank
), cte_2 AS(
-- the value in the rank column does not matter for this analysis, so we did not include it in SELECT
  SELECT song_title, artist, charts, tempo
  FROM cte_1
), cte_3 AS(
-- pivot to make the table wider
  PIVOT_WIDER cte_2
  ON charts
  USING first(charts)
), cte_4 AS(
-- concatenate the three charts into one column
  SELECT CONCAT(rank_s, ', ', rank_t, ', ', rank_b) AS charts
  FROM cte_3
)
-- group by the new charts column and count how many songs appear on each combination of charts
  SELECT charts, COUNT(*) AS counts
  FROM cte_4
  GROUP BY charts

The second method uses COUNT(*), FILTER, and WHERE. Each line counts the number of rows that meet the condition in WHERE.

CREATE OR REPLACE TABLE total_counts AS
SELECT COUNT(*) FILTER (WHERE rank_s IS NOT NULL AND rank_t IS NULL AND rank_b IS NULL) AS only_s,
       COUNT(*) FILTER (WHERE rank_s IS NULL AND rank_t IS NOT NULL AND rank_b IS NULL) AS only_t,
       COUNT(*) FILTER (WHERE rank_s IS NULL AND rank_t IS NULL AND rank_b IS NOT NULL) AS only_b,
       COUNT(*) FILTER (WHERE rank_s IS NOT NULL AND rank_t IS NOT NULL AND rank_b IS NULL) AS only_spotify_and_tiktok,
       COUNT(*) FILTER (WHERE rank_s IS NULL AND rank_t IS NOT NULL AND rank_b IS NOT NULL) AS only_tiktok_and_billboard,
       COUNT(*) FILTER (WHERE rank_s IS NOT NULL AND rank_t IS NULL AND rank_b IS NOT NULL) AS only_spotify_and_billboard,
       COUNT(*) FILTER (WHERE rank_s IS NOT NULL AND rank_t IS NOT NULL AND rank_b IS NOT NULL) AS all_three
FROM combined_cleaned

To create the output, we created a new dataframe with the labels as column headers and all the values set to zero. The we replaced each of the zeros with the appropriate counts. We used .sum() to count the non NA values for each column.

(combined.
  melt(id_vars = ["song_title"], var_name = "charts", value_name = "rank")
)
                                       song_title  ...                                     rank
0                                  'Til You Can't  ...                             Cody Johnson
1                                'Till I Collapse  ...                        Eminem, Nate Dogg
2     1, 2, 3 (feat. Jason Derulo & De La Ghetto)  ...                              Sofía Reyes
3     1, 2, 3 (feat. Jason Derulo & De La Ghetto)  ...  Sofía Reyes, Jason Derulo, De La Ghetto
4                      10 Things I Hate About You  ...                                Leah Kate
...                                           ...  ...                                      ...
4640                                      traitor  ...                                  100.607
4641            up at night (feat. justin bieber)  ...                                  100.012
4642                                    vice city  ...                                   87.023
4643                           you broke me first  ...                                  124.148
4644                                        İmdat  ...                                  133.967

[4645 rows x 3 columns]
data = {'only_spotify': [0], 'only_tiktok': [0], 'only_billboard': [0], 'only_spotify_and_tiktok': [0], 'only_tiktok_and_billboard': [0], 'only_spotify_and_billboard': [0], 'all_three': [0]}
totals = pd.DataFrame(data)

totals['only_spotify'] = [(combined['rank_s'].notna() &
                           combined['rank_t'].isna() &
                           combined['rank_b'].isna()).sum()]
totals['only_tiktok'] = [(combined['rank_s'].isna() &
                          combined['rank_t'].notna() &
                          combined['rank_b'].isna()).sum()]
totals['only_billboard'] = [(combined['rank_s'].isna() &
                             combined['rank_t'].isna() &
                             combined['rank_b'].notna()).sum()]
totals['only_spotify_and_tiktok'] = [(combined['rank_s'].notna() &
                                      combined['rank_t'].notna() &
                                      combined['rank_b'].isna()).sum()]
totals['only_tiktok_and_billboard'] = [(combined['rank_s'].isna() &
                                        combined['rank_t'].notna() &
                                        combined['rank_b'].notna()).sum()]
totals['only_spotify_and_billboard'] = [(combined['rank_s'].notna() &
                                         combined['rank_t'].isna() &
                                         combined['rank_b'].notna()).sum()]
totals['all_three'] = [(combined['rank_s'].notna() &
                        combined['rank_t'].notna() &
                        combined['rank_b'].notna()).sum()]

GT(totals)
only_spotify only_tiktok only_billboard only_spotify_and_tiktok only_tiktok_and_billboard only_spotify_and_billboard all_three
577 212 70 40 0 19 11

TikTok Songs and Tempo

Did popular songs on TikTok have higher tempos? Many viral trends on TikTok involve creators doing a short dance. We wonder if there is a relationship to the tempo of songs and what which songs are popular. We used a simple scatter plot to visualize the data.

combined_cleaned |>
  ggplot(data = combined_cleaned, mapping = aes(x = rank_t, y = tempo)) +
  geom_point() +
  labs(x = "Song Rank on TikTok", y = "Tempo of Song")

Nothing here since SQL doesn’t visualize.

combined.plot.scatter(x="rank_t", y="tempo").set(xlabel="Rank on TikTok", ylabel="Tempo (BPM)", title="Tempo of Popular Songs on TikTok")

As seen from the plot, there does not appear to be a correlation between songs being popular on TikTok and tempo. In the original data, some of the songs had a rank of 0. The documentation does not describe why that some songs included in the chart have a rank of 0.